Languages with more speakers tend to be harder to (machine-)learn

Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ChatGPT chatbot, show impressive performance on a wide range of linguistic tasks, thus providing cognitive science and linguistics with a computational working model to empirically study different aspects of human language. Here, we use LMs to test the hypothesis that languages with more speakers tend to be easier to learn. In two experiments, we train several LMs—ranging from very simple n-gram models to state-of-the-art deep neural networks—on written cross-linguistic corpus data covering 1293 different languages and statistically estimate learning difficulty. Using a variety of quantitative methods and machine learning techniques to account for phylogenetic relatedness and geographical proximity of languages, we show that there is robust evidence for a relationship between learning difficulty and speaker population size. However, contrary to expectations derived from previous research, our results suggest that languages with more speakers tend to be harder to learn.

Supplementary Figure 1 | Illustration of measuring learning difficulty in Study 1. Orange circles represent observed bits-per-symbol for a synthetic dataset for a "language" with two characters ('a', 'b') that are randomly emitted ('Coin flip').The underlying entropy rate is therefore h = log2(2) = 1.00.Orange lines represent fitted values based on the ansatz function.As can be seen, the extrapolated entropy rate is very close to the theoretical expectation.Blue circles represent observed bits-per-symbol that are needed (on average) to encode/predict symbols based on increasing amounts of training data for a real document, here a translation of the Bible into Uyghur on the level of characters ('uig-x-bible-romanized_parents').The blue line represents fitted values based on the ansatz function.The extrapolated entropy rate is h = 1.00.The b parameter describes the shape of the fitted curves where higher values indicate fast convergence and thus lower learning difficulty.For the coin flip data, convergence is much faster, thus b = 1.12, whereas for the Uyghur text, the LM needs comparatively more training data and convergence to the underlying source entropy is rather slow, thus b = 0.38.

Supplementary Table 1 | Multilevel mixed-effects linear regression results (Study 1)
. Column 1: Language model.Column 2: Symbol type.Column 3: Estimated effect of speaker population size, βLMER.Listed are models with the lowest AIC per LM and per symbol from a total of 2,430 models that include a fixed effect (and potential random slopes) for speaker population size.To control for the potential non-independence of datapoints due to phylogenetic relatedness and geographical proximity, all models additionally include fixed covariates, random intercepts and random slopes.Standard errors and p-values are given in brackets.Column 4: Difference-in-AIC between reduced and full models.Reduced models do not include a fixed effect (and potential random slopes) for speaker population size.Column 5: Percentage of cases where the full model has a better fit (i.e.lower AIC) than the corresponding reduced model.Models that do not include a fixed effect for speaker population size but potential random slopes for speaker population size are excluded.Column 6: Percentage of cases where the full model has a better fit than the corresponding reduced model.Here, models, where   is constrained to be zero, are included.Column 7 -9: selected control covariates for the model with the lowest AIC per LM and symbol.N = 3,853.Statistical significance is determined based on parametric tests.See Methods for further information on statistical methods.Supplementary Table 6 | Bayesian Linear regression results.Column 1: Version ('all' or only fully 'parallel).We first ran double-selection lasso linear regressions with the log of b as the outcome, the log of speaker population size as the covariate of interest and a set of potential controls that consists of the same variables as the small set described in the main part of the paper plus -since it is not possible to cluster variables at the level of individual languages for Bayesian models -indicator variables for the levels of language.The set consists of Nc = 1,505 variables for the 'all' version and Nc = 1,457 variable for the 'parallel' version.To select the optimal value for the penalty parameter for each lasso, we use a plugin iterative formula.Column 3: Language model.Column 4: Symbol type.Column 5: Number of controls selected by lasso.We then fitted Bayesian linear regressions with the log of b as the outcome and the log of speaker population size plus the selected controls as the covariates.We used a burn-in period of 100,000 iterations, normal priors with mean 0 and variance of 10,000 for the regression coefficients and an inverse-gamma prior with shape and scale parameters of 0.01 for the error variance.We used Gibbs sampling and simulated four chains for each model.Column 5: The Gelman-Rubin convergence diagnostic 1,2 , Rc, is given in parentheses for each model.All values are below 1.2 indicating convergence.Column 6: 95% credible interval for the estimated coefficient of speaker population sized.Column 7: Estimated posterior probability that the coefficient of speaker population size is negative (in %).  4 in the main part of the paper.Columns 3-8: Results for the NT version (N = 504).Columns 9-14: Results for the OT version (N = 138).Columns 3,9: Estimated fixed effect of speaker population size, βLMER.Listed are models with the lowest AIC per LM and per symbol from a total of 728 models that include a fixed effect (and potential random slopes) for speaker population size.Standard errors are given in brackets.Columns 4,10: Difference-in-AIC between reduced and full models.Columns 5,11: Percentage of cases where the full model has a better fit (i.e.lower AIC) than the corresponding reduced model (here models that do not include a fixed effect for speaker population size but potential random slopes for speaker population size are excluded).Columns 6,12: Percentage of cases where the full model has a better fit than the corresponding reduced model (here models, where βLMER is constrained to be zero, are included).

Table 2 |
Multilevel mixed-effects linear regression results (Study 1).See Supplementary Table1for a description of the column content.Here, we include only documents from fully parallel corpora (N = 3,224).All estimated  -coefficients are significant at p < .05,except for LSTMcomp as LM on the levels of characters, here p = .090.

Table 3 |
Double-selection lasso linear regression results (Study 1).Column 1: Set of potential candidate variables to be included as controls in each model (Number of controls are given in brackets).Column 2: Language model.Column 3: Symbol type.Column 4: Number of cases.Column 5: Number of controls selected by lasso.Column 6,7: Out-of-sample R 2 of learning difficulty (column 6)/speaker population size (column 7) on the control variables selected by the lasso.Column 8: Estimated coefficient for the effect of speaker population size, βDS.Robust standard errors (clustered at the level of individual languages) and p-values are given in brackets.*p < 0.05, ** p < 0.01, *** p < 0.005.Statistical significance is determined based on non-parametric permutation tests (see Methods for further information on statistical methods).

Table 4 |
Double-selection lasso linear regression results (Study 1).See Supplementary Table3for details.Here, only fully parallel corpora are considered.

Table 5 |
Cross-fit partialing-out lasso linear regression (Study 1).Column 1: Version ('all' or only fully 'parallel').Column 2: Set of potential candidate variables to be included as controls in each model (Number of controls are given in brackets).Column 3: Language model.Column 4: Symbol type.Column 5: Number of controls selected by lasso.Column 6: Estimated coefficient for the effect of speaker population size, βXPO.Robust standard errors clustered at the level of individual languages are given in brackets.*p < 0.05, ** p < 0.01, *** p < 0.005.Statistical significance is determined based on parametric tests.To select the optimal value for the penalty parameter for each lasso, we use a plugin iterative formula.

Table 7 |
Multilevel mixed-effects linear regression results (Study 2).Column 1: Language model.Column 2: Symbol type.The table shows random effects/slopes structure for the selected models listed in Table Column 7,8: Random effects/slopes structure for the NT version.Column 13,14: Random effects/slopes structure for the OT version.*p < 0.05, ** p < 0.01, *** p < 0.005.Statistical significance is determined based on parametric tests.See Methods for further information on statistical methods.